Goto

Collaborating Authors

 data preprocessing


Deep Learning, Machine Learning, Advancing Big Data Analytics and Management

Hsieh, Weiche, Bi, Ziqian, Chen, Keyu, Peng, Benji, Zhang, Sen, Xu, Jiawei, Wang, Jinlang, Yin, Caitlyn Heqi, Zhang, Yichao, Feng, Pohsun, Wen, Yizhu, Wang, Tianyang, Li, Ming, Liang, Chia Xin, Ren, Jintao, Niu, Qian, Chen, Silin, Yan, Lawrence K. Q., Xu, Han, Tseng, Hong-Ming, Song, Xinyuan, Jing, Bowen, Yang, Junjie, Song, Junhao, Liu, Junyu, Liu, Ming

arXiv.org Artificial Intelligence

Advancements in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management into pivotal domains for research and application. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies, emphasizing their role in uncovering actionable insights from massive, high-dimensional datasets. The study presents a systematic overview of data preprocessing techniques, including data cleaning, normalization, integration, and dimensionality reduction, to prepare raw data for analysis. Core analytics methodologies such as classification, clustering, regression, and anomaly detection are examined, with a focus on algorithmic innovation and scalability. Furthermore, the text delves into state-of-the-art frameworks for data mining and predictive modeling, highlighting the role of neural networks, support vector machines, and ensemble methods in tackling complex analytical challenges. Special emphasis is placed on the convergence of big data with distributed computing paradigms, including cloud and edge computing, to address challenges in storage, computation, and real-time analytics. The integration of ethical considerations, including data privacy and compliance with global standards, ensures a holistic perspective on data management. Practical applications across healthcare, finance, marketing, and policy-making illustrate the real-world impact of these technologies. Through comprehensive case studies and Python-based implementations, this work equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics. It bridges the gap between theory and practice, fostering the development of innovative solutions for managing and leveraging data in the era of artificial intelligence.


The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities

Parthasarathy, Venkatesh Balavadhani, Zafar, Ahtsham, Khan, Aafaq, Shahid, Arsalan

arXiv.org Artificial Intelligence

This report examines the fine-tuning of Large Language Models (LLMs), integrating theoretical insights with practical applications. It outlines the historical evolution of LLMs from traditional Natural Language Processing (NLP) models to their pivotal role in AI. A comparison of fine-tuning methodologies, including supervised, unsupervised, and instruction-based approaches, highlights their applicability to different tasks. The report introduces a structured seven-stage pipeline for fine-tuning LLMs, spanning data preparation, model initialization, hyperparameter tuning, and model deployment. Emphasis is placed on managing imbalanced datasets and optimization techniques. Parameter-efficient methods like Low-Rank Adaptation (LoRA) and Half Fine-Tuning are explored for balancing computational efficiency with performance. Advanced techniques such as memory fine-tuning, Mixture of Experts (MoE), and Mixture of Agents (MoA) are discussed for leveraging specialized networks and multi-agent collaboration. The report also examines novel approaches like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), which align LLMs with human preferences, alongside pruning and routing optimizations to improve efficiency. Further sections cover validation frameworks, post-deployment monitoring, and inference optimization, with attention to deploying LLMs on distributed and cloud-based platforms. Emerging areas such as multimodal LLMs, fine-tuning for audio and speech, and challenges related to scalability, privacy, and accountability are also addressed. This report offers actionable insights for researchers and practitioners navigating LLM fine-tuning in an evolving landscape.


Big Data - Supply Chain Management Framework for Forecasting: Data Preprocessing and Machine Learning Techniques

Jahin, Md Abrar, Shovon, Md Sakib Hossain, Shin, Jungpil, Ridoy, Istiyaque Ahmed, Tomioka, Yoichi, Mridha, M. F.

arXiv.org Machine Learning

This article intends to systematically identify and comparatively analyze state-of-the-art supply chain (SC) forecasting strategies and technologies. A novel framework has been proposed incorporating Big Data Analytics in SC Management (problem identification, data sources, exploratory data analysis, machine-learning model training, hyperparameter tuning, performance evaluation, and optimization), forecasting effects on human-workforce, inventory, and overall SC. Initially, the need to collect data according to SC strategy and how to collect them has been discussed. The article discusses the need for different types of forecasting according to the period or SC objective. The SC KPIs and the error-measurement systems have been recommended to optimize the top-performing model. The adverse effects of phantom inventory on forecasting and the dependence of managerial decisions on the SC KPIs for determining model performance parameters and improving operations management, transparency, and planning efficiency have been illustrated. The cyclic connection within the framework introduces preprocessing optimization based on the post-process KPIs, optimizing the overall control process (inventory management, workforce determination, cost, production and capacity planning). The contribution of this research lies in the standard SC process framework proposal, recommended forecasting data analysis, forecasting effects on SC performance, machine learning algorithms optimization followed, and in shedding light on future research.


Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks

Gong, Ping, Ma, Yuxin, Li, Cheng, Ma, Xiaosong, Noh, Sam H.

arXiv.org Artificial Intelligence

In this paper, we primarily focus on understanding the data preprocessing pipeline for DNN Training in the public cloud. First, we run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files. The preliminary results show that data preprocessing is a clear bottleneck, even with the most efficient software and hardware configuration enabled by NVIDIA DALI, a high-optimized data preprocessing library. Second, we identify the potential causes, exercise a variety of optimization methods, and present their pros and cons. We hope this work will shed light on the new co-design of ``data storage, loading pipeline'' and ``training framework'' and flexible resource configurations between them so that the resources can be fully exploited and performance can be maximized.


Data Preprocessing in R for Data Science - Detechtor

#artificialintelligence

In Data Science, Data Preprocessing is a very crucial part in the making of a Machine Learning model. Without it, our Machine learning models will not work properly. Think of it for example like preparing the farm to plant crops. Without proper preparation, we would have a difficult time planting and it would negatively affect the crops yield. This is probably going to be the most boring part of this course but once we are done with it we will have a smoother ride with the rest of the course.


Data Preprocessing with scikit-learn -- Missing Values

#artificialintelligence

By popular demand from my previous article, in this tutorial I illustrate how to preprocess data using scikit-learn, a Python library for machine learning. Data preprocessing transforms data into a format which is more suitable for estimators. In my previous articles I illustrated how to deal with missing values, normalization, standardization, formatting and binning with Python pandas. In this tutorial I show you how to deal with mising values with scikit-learn. For the other preprocessing techniques in scikit-learn, I will write other posts.


How to Avoid Data Leakage in Data Preprocessing

#artificialintelligence

Avoid data leaking from the test set into the training set. “How to Avoid Data Leakage in Data Preprocessing” is published by Rukshan Pramoditha.


The Importance of Data Preprocessing for Machine Learning in the E-Commerce Industry

#artificialintelligence

Big data, as the name suggests, are large volumes of data that contain a variety of data that travel in high velocity. Big data are bound to contain dirty data as it is collected from various sources that are raw or unprocessed. Data preprocessing is the process of transforming raw data to an understandable format which is ready for analytical uses. Machine Learning is an artificial intelligence subset and an analytical application that is used to make decisions without programming by receiving and analyzing data. E-commerce industry is the industry which revolves around the application of technology into commercial businesses.


Data Preprocessing

#artificialintelligence

While working on Machine Learning and related fields, we often come across huge datasets. In order for this data to be efficiently used by the model, a little bit of preprocessing is required that makes the data to be more structured. NOTE: There may be some additional steps included in between based on the complexity of the dataset, but the steps mentioned here are standard to almost any dataset.


Data Preprocessing with Python Pandas -- Part 3 Normalisation

#artificialintelligence

This tutorial explains how to preprocess data using the Pandas library. Preprocessing is the process of doing a pre-analysis of data, in order to transform them into a standard and normalised format. In this tutorial we deal only with normalisation. In my previous tutorials I dealt with missing values and data formatting. Data Normalisation involves adjusting values measured on different scales to a common scale.